-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(test suite): use cgroups to detect if a test leaks processes #6470
base: main
Are you sure you want to change the base?
Conversation
Tested manually by commenting out NeonEnv.stop()'s self.attachment_service.stop() call.
9a4d27b
to
4204a7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking great!
If we need to re-run until we see a coverage failure, this might be handy: fcf17a3 |
Permission problems:
|
No tests were run or test report is not availableTest coverage report is not availableThe comment gets automatically updated with the latest test results
2978c83 at 2024-01-26T14:11:42.442Z :recycle: |
Looking great in https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6470/7660119788/index.html#suites/90de3f9cafdc78be9db0b2ada81f7c26/ea7eab72b6180321:
Was this an injected failure? No. |
It looks the similar thing as we discovered in #6270 (comment) — when endpoint start fails it leaves stray compute_ctl for some time |
If endpoint startup fails, probably we could just leave out the sync-safekeepers because no LSN should had changed, but perhaps that's a difficult call to make.. Alternatively that particular assert could be formulated as "basebackup fails". Well, looking at control_plane/src/endpoint.rs this is quite clearly a problem of just not waiting for the spawned postgres to stop, similarly to #6474 but we probably cannot just kill it but let the sync safekeepers go through, as we cannot know that it was basebackup which failed. |
Dockerfile.buildtools
Outdated
# Add nonroot user | ||
RUN useradd -ms /bin/bash nonroot -b /home | ||
SHELL ["/bin/bash", "-c"] | ||
RUN echo "ALL ALL = (ALL) NOPASSWD: ALL" >> /etc/sudoers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOOD: security review
.github/workflows/build_and_test.yml
Outdated
@@ -420,19 +420,22 @@ jobs: | |||
container: | |||
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }} | |||
# Default shared memory is 64mb | |||
options: --init --shm-size=512mb | |||
options: --init --shm-size=512mb --cgroupns=private --privileged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO security review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asked for review in Slack https://neondb.slack.com/archives/C059ZC138NR/p1706265241134019
This reverts commit 800d3d1.
Dropping the same comment I dropped on Slack:
I suggest we implement an alternative - this increases privilege escalation risks majorly. It provides almost the same capabilities for the container as the host machine. There are so many ways to exploit this, implementing this would pose a big security threat. (via container escape, privilege escalation to cgroup manipulation, execute arbitrary commands, lateral movements etc. etc.)
Like the above scenario, if an attacker gain access to the container, they can simply abuse the sudo privileges from the non-root user. This bypasses any security best practices. This change does NOT pass security review - please implement an alternative. @problame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can provide sudo privileges to the specific commands only. We should definitely avoid sudo on nonroot user as this will basically make any non-root users run with root rights.
Instead of --privileged, why not use more fine-grained capabilities? Or perhaps use --mount on the necessary filesystems?
So, here's my hack to remount the cgroupfs in the container as Create a docker container.
So, the bash process is PID 1 in the container's pid namespace.
So, PID 1 in the container is PID
Now our
But it cannot operate on the root cgroup because it's owned by
|
Epic: #6485
Problem
Some tests leave stay processes behind after they exit.
This is the potential root cause for failed coverage-report generation, as well as other flakiness.
Solution
Before executing a test, create & enter a cgroup.
After the test is done, ensure that there are no processes left in the cgroup.
This banks on the assumption that the tests themselves don't do anything with the cgroup hierarchy, which is currently the case.
Changes
__enter__
and__exit__
context manager hooks to implement the solution outlined above.vanilla_pg
andneon_env_builder
in one test where we saw failures=> follow-up stray process check: move out of NeonEnvBuilder into an autouse fixture #6487
test_neon_two_primary_endpoints_fail
; it doesn't add toNeonEnv.endpoints
,hence it didn't get stopped as part of
NeonEnvBuilder.__exit__
's call toNeonEnv.stop
.Follow-Ups